Data Review/Exploration

There are different places to start with this project. The first thing I want to do is extract the number of stores I will be forecasting reservations for, since the given data is more than the actual number of restaurants that need visitors foretasted. The sample submission includes the dates at the end of the id, so first I need to remove that and then count the number of unique stores.

forecast_stores <- str_sub(sample_submission$id, 1, 20) %>% 
  unique()

air_resv_unique <- unique(air_reserve$air_store_id)
hpg_resv_unique <- unique(hpg_reserve$hpg_store_id)
air_rel_unique <- unique(store_id_relation$air_store_id)
air_visit_unique <- unique(air_visit_data$air_store_id)

We will need to forecast visits for 821 unique stores. We have 314 unique store IDs from the air reservations dataset, and 13325 unique reservation IDs from the hpg information dataset. Given the large amount of hpg reservation information, the relational store information has 150 unique air IDs. Lastly, the actual number of visitors has 829 unique store ids. What does all of this mean?

Well, this means that not all stores take reservations. So when I make my forecasts I will only be able to use reservation data for some of the stores. Why is it the case that not all stores take reservations?

We have information on all of the stores in terms of what types of restaurants there are by genre, name and location. Below is a summary by genre for the air dataset.

# Genre Breakdown

air_store_info %>% 
  select(air_genre_name) %>% 
  group_by(air_genre_name) %>% 
  summarise(Count = n()) %>% 
  arrange(desc(Count)) %>% 
  kable(caption = 'air Store Genres',
        col.names = c('Genre Name', 'Count')) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
air Store Genres
Genre Name Count
Izakaya 197
Cafe/Sweets 181
Dining bar 108
Italian/French 102
Bar/Cocktail 79
Japanese food 63
Other 27
Yakiniku/Korean food 23
Western food 16
Okonomiyaki/Monja/Teppanyaki 14
Creative cuisine 13
Asian 2
International cuisine 2
Karaoke/Party 2

Some of the stores we have are bars (including Izakaya), some of them are Cafes - we wouldn’t expect these places to take reservations, and these places make up a sizable portion of the dataset. For those places that don’t take reservations, we will have to use more stochastic methods, while places with reservations can use the additional information.

The hpg store genre breakdown is given below.

hpg_store_info %>% 
  select(hpg_genre_name) %>% 
  group_by(hpg_genre_name) %>% 
  summarise(Count = n()) %>% 
  arrange(desc(Count)) %>% 
  kable(caption = 'hpg Store Genres',
        col.names = c('Genre Name', 'Count')) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
hpg Store Genres
Genre Name Count
Japanese style 1750
International cuisine 700
Creation 410
Seafood 339
Grilled meat 325
Italian 249
Spain Bar/Italian Bar 229
Chinese general 91
Japanese food in general 85
Japanese cuisine/Kaiseki 64
Creative Japanese food 60
Karaoke 60
Shabu-shabu/Sukiyaki 59
Okonomiyaki/Monja/Teppanyaki 44
Party 40
Korean cuisine 38
French 27
Steak/Hamburger/Curry 24
Bistro 22
Cafe 16
Sushi 11
Pasta/Pizza 10
Bar/Cocktail 7
Amusement bar 5
Thai/Vietnamese food 5
Western food 5
Cantonese food 4
Sichuan food 3
Dim Sum/Dumplings 2
Sweets 2
Shanghai food 1
Spain/Mediterranean cuisine 1
Taiwanese/Hong Kong cuisine 1
Udon/Soba 1

Using the hpg - air relational dataframe, we will use the hpg reservation/visit information to predict the air store visits, which are the ones that require forecasting for the submission.

Reservations Exploration

Using air_reserve

Lets explore how many visitors have reserved ahead of their visit. All of the procceding graphs will group together all of the stores to see overall trends in the data.

# When are our reservations made?
air_reserve %>%
  group_by(Week = format(as.Date(visit_datetime), "%Y-%W-%m")) %>% 
  summarise(Visitors = sum(reserve_visitors)) %>% 
  ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = 'air Reservations Made by Week')

We can see that there was a big spike in the months leading up to the New Year, and we have weeks with peaks after wards. Our forecasts will cover the visits of last week of April up until the end of May. We have some reservations for this time period already (which includes the “Golden Week” in Japan) as can be seen above, and we can contrast this with the the date of the visits below.

# What days are our reservations made for?
air_reserve %>%
  group_by(Week = format(as.Date(reserve_datetime), "%Y-%W-%m")) %>% 
  summarise(Visitors = sum(reserve_visitors)) %>% 
  ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = 'air Expected Visits by Week')

Thus, a lot of our visits will take place during the Golden Week, we were only given just some of the reservations made for our forecast period.

Using hpg_reserve

Lets take a look at the data from the hpg.

hpg_reserve %>%
  group_by(Week = format(as.Date(visit_datetime), "%Y-%W-%m")) %>% 
  summarise(Visitors = sum(reserve_visitors)) %>% 
  ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = 'hpg Reservations Made by Week')

hpg_reserve %>%
  group_by(Week = format(as.Date(reserve_datetime), "%Y-%W-%m")) %>% 
  summarise(Visitors = sum(reserve_visitors)) %>% 
  ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = 'hpg Expected Visits by Week')

A similar story to the air_reserve data.

Visitor Data

Lets see how many actual visitors we have in the different stores.

# How many visitors have we actually had
air_visit_data %>% 
  group_by(Week = format(as.Date(visit_date), "%Y-%W-%m")) %>% 
  summarise(Visitors = sum(visitors)) %>% 
  ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = 'Actual Visitors')

This includes visits among those places which are not reservation based, so we will have to sift those out by merging the store info and actual visitor dataframes. Otherwise we can see that the peaks remain around 100k for actual store visitors.

# I can add the store info to the air visit data
air_visit_info <- merge(air_visit_data, air_store_info, by = 'air_store_id')
rm(air_visit_data)
air_visit_info %>% 
  group_by(Week = format(as.Date(visit_date), "%Y-%W-%m"), Genre = air_genre_name) %>% 
  summarise(Visitors = sum(visitors)) %>% 
  ggplot(aes(x = Week, y = Visitors, color = Genre, group = Genre)) + geom_line() + 
  scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(14)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = 'Actual Visitors')

This gives us a better breakdown of actual visitors. We can see that the most popular genre is the Izakaya, which was anticipated from looking at the genre breakdown, followed by Cafe/Sweets. Italian/French, and then the Dining bar. I imagine dining bars take reservations, and so would Italian/French places so places with reservations will probably be on the smaller side of visitor traffic.

This is without the hpg merger however.

Full Dataset Creation/Exploration

Cleaning

First I will clean up the air_reserve dataset. In this instance, I want the total number of people that are expected to visit the different stores for a given date, this means aggregating the data and keeping just the date poriton of datetime.

# I just care about expected visitors on a certain day, not the time nor necessarily the reservation information
air_reserve <- air_reserve %>% 
  mutate(visit_day = as.Date((visit_datetime))) %>%
  select(-c(reserve_datetime, visit_datetime))

Next I clean up the hpg reserve dataset to prepare it for a merger with the air_reserve data.

# At least for the first go around in my predicitons, I just want information related to the air stores.

# First I match up the hpg id with the air id
row.names(store_id_relation) <- store_id_relation[, 2]
hpg_reserve$air_store_id <- store_id_relation[hpg_reserve$hpg_store_id, 1]

# Next I keep just the observations with the air stores
hpg_reserve <- hpg_reserve %>% 
  drop_na() %>% 
  mutate(visit_day = as.Date((visit_datetime))) %>%
  select(-c(reserve_datetime, visit_datetime, hpg_store_id))
# I want to have the columns match up to bind properly and I want to change the column order
air_reserve <- air_reserve[, c(1,3,2)]
hpg_reserve <- hpg_reserve[, c(2,3,1)]
air_reserve <- rbind(air_reserve, hpg_reserve)
rm(hpg_reserve)

# Put the reservation and store information into one single dataframe
air_reserve_info <- merge(air_reserve, air_store_info, by = 'air_store_id')
rm(air_reserve)

# I want one aggregate visitor amount per visiting day
air_reserve_info <- aggregate(reserve_visitors ~ air_store_id + 
                                visit_day + air_genre_name + 
                                air_area_name + 
                                latitude + 
                                longitude, air_reserve_info, FUN = sum)

Now that I have reserve data, I want to add to it the actual visitor data and make one large dataset.

# First I aggregate the visitor info
air_visit_info <- aggregate(visitors ~ air_store_id + 
                                visit_date + air_genre_name + 
                                air_area_name + 
                                latitude + 
                                longitude, air_visit_info, FUN = sum)

air_visit_info$visit_day <- air_visit_info$visit_date

air_full <- merge(air_reserve_info, air_visit_info[, c(1,7,8)], by = c('air_store_id', 'visit_day'))
rm(air_visit_info, air_reserve_info)

Before I look at the reservations and actual visitation patterns, I will first add in the holiday data.

# first I will add in the same column name as the one in the full dataset.
date_info$visit_day <- as.Date(date_info$calendar_date)
date_info <- date_info[, c(4,2,3)]
air_full <- merge(air_full, date_info, by = 'visit_day')
rm(date_info)

Reservations/Visits by Genre

Lets see how well the reservation data matches the visitor data.

air_full$pct_reserve <- air_full$reserve_visitors/air_full$visitors
  
air_full %>% 
  ggplot(aes(factor(air_genre_name), pct_reserve)) + geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90)) + coord_flip() +
  labs(x = 'Genre', y = 'Percent Reserve',
       title = 'Boxplot of Percent of Visitors that Reserved vs Genre')

As we can see, there are some serious outliers in the data. Lets trim down the data and see if we can establish a relationship then. As it stands now, it seems like there are lot more reservations for the air places of interest than there are actual visitors.

air_full %>% 
  group_by(air_genre_name) %>% 
  filter(!(abs(pct_reserve - median(pct_reserve)) > 2*sd(pct_reserve))) %>%
  ggplot(aes(factor(air_genre_name), pct_reserve)) + geom_violin(trim = T, scale = T) +
  geom_boxplot(width = 0.15) +
  theme(axis.text.x = element_text(angle = 90)) + coord_flip() +
  labs(x = 'Genre', y = 'Percent Reserve',
       title = 'Violin and Boxplot of Percent of Visitors that Reserved\nvs Genre, Trimmed')

This looks much better. We can still see that there are quite a few places that get extra reservations, these are places like Cafes, Bars, and Dining Bars. I would’ve thought that it would be more so the other way around, that reservations make up a small percentage of total visitors for that day. This might be because these places fill up quickly, maybe the system takes multiple reservations. I’m not entirely sure. I’ve also used violin plots to show that the median number of reservations do usually fall within a normal figure, i.e, below 100% while the mean can be pulled one way or the other.

Reservations/Visits by Day of the Week

Lets also take a look of the breakdown of our data by geographic area. The graph below already trims some of the outliers.

air_full %>% 
  group_by(day_of_week) %>% 
  filter(!(abs(pct_reserve - median(pct_reserve)) > 2*sd(pct_reserve))) %>%
  ggplot(aes(factor(day_of_week), pct_reserve)) + geom_violin(trim = T, scale = T) +
  geom_boxplot(width = 0.15) +
  labs(x = 'Day of the Week', y = 'Percent Reserve',
       title = 'Violin and Boxplot of Percent of Visitors that Reserved\nper Day of the Week')

The plots show that the means are pretty similar for most days of the week. We can see that the medians are lower for Saturday and Sunday, but in general all of these guitar-looking violins reveal that, modally, most places have more visitors than reservations throughout the different days of the week.

Modeling

When it comes to modeling, the ask is going to be a little tough. This is because the submission requires forecasting visitors for every store for 39 days. The full dataset that I am working with is a panel dataset, so forecasting ahead of the panel will be somewhat tricky. For instance, I would need to look at an ACF of each of the many store ids in order to see which stores have autocorreltaion and which ones don’t.

Additionally, for any linear models, I will have to use a box-cox transformation since there is non-normality (the # of visitors cannot be negative).

What’s my approach? Well, since I did all of the hard work to get here, I might as well try and forecast the overall level of visitors and then use the mean of that as my prediction. Otherwise, I would’ve went with individual predictions for each store id, and I can try that using a fixed effects panel.

I will breakdown the visitor data into 5 genres: Izakaya, Cafe/Sweets, Dining Bar, Italian/French and all others. Although I want to use the reservations to help with my predictions, I think it might be better to go without since I do not have a seperate set of predictors for visitor reservations and visitors. This means that the predictors I use to model the number of reservations will be used to model the number of visitors as well and this two-step regression might not do so well.

Izakayas

First I want to compare the overall visits Izakaya with the aggreagte visits that we would observe with the sample.

izak <- air_full[air_full$air_genre_name=='Izakaya',]
izak.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = izak)
autoplot(ts(izak.ts$visitors)) +
  labs(title = 'All Data Izakaya',
       x = 'Time',
       y = 'Visitors')

izak.sample <- izak[which(izak$air_store_id %in% forecast_stores),]
izak.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = izak.sample)
autoplot(ts(izak.sample.ts$visitors)) +
  labs(title = 'Sample Submission Izakaya',
       x = 'Time',
       y = 'Visitors')

To forecast this time series, I will try ets and fourier methods on the sample submission aggregates after creating a test/validation split.

# The frequency is set to 7 to represent the weekly fluctuations
izak.train <- ts(izak.sample.ts[1:round(nrow(izak.sample.ts)*0.90), "visitors"], start = 1, frequency = 7)
izak.test <- ts(izak.sample.ts[round(nrow(izak.sample.ts)*0.9):nrow(izak.sample.ts), "visitors"],
                start = end(izak.train) + c(0, 0), frequency = 7)

I used a 90/10 split because it produces a similar number of periods as those that we have to forecast, namely, we have to forecast 39 days out and the test sample has 49 days.

ETS Forecast

First I will take a look at the ETS model. The time-series is sesonalized to a frequency of 7, since we have a weekly pattern in the dataset.

izak.ets <- ets(izak.train, allow.multiplicative.trend = T, model = "MAM")
izak.ets.forecast <- forecast(izak.ets, h = 49)
autoplot(izak.ets.forecast, PI = F) + 
  labs(y = 'Visitors')

accuracy(izak.ets.forecast,
         x = izak.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
ETS Accuracy Output
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -34.671107 372.9731 186.0986 -22.798045 37.36347 0.8521535 0.2984924 NA
Test set 9.726515 332.6870 276.9076 -3.166493 19.97433 1.2679716 0.6495821 0.4546517

The ETS provides us with a baseline. We have a model with multiplicative errors, additive trend, and multiplicative seasonality.

Holt-Winters

Although the Holt-Winters model is technically cycled through when ets() is called, it might perform well on the test set so it is worth taking a look.

izak.hw <- hw(izak.train, seasonal = 'multiplicative', h = 49)
autoplot(izak.hw, PI = F)

accuracy(izak.hw,
         x = izak.test) %>% knitr::kable(caption = 'Holt-Winters Accuracy Output') %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
Holt-Winters Accuracy Output
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -40.21141 346.0954 199.7395 -24.0191161 40.84157 0.9146155 0.3098625 NA
Test set 99.05261 398.4692 314.1756 0.8692316 20.50341 1.4386234 0.3577733 0.5193806

It does not perform better than the ets() on the test metrics, so it will not be considered.

Fourier

Hyndman recommends trying a fourier model. I first seasonalize the sample data so that it represents a daily time series.

izak.train <- ts(izak.sample.ts[1:round(nrow(izak.sample.ts)*0.90), "visitors"], frequency = 365)
izak.test <- ts(izak.sample.ts[round(nrow(izak.sample.ts)*0.9):nrow(izak.sample.ts), "visitors"],
                start = end(izak.train) + c(0, 0), frequency = 365)
# Cycle through different models and select the best fit
min <- Inf
best <- 0
for (k in 1:182) {
  fit <- tslm(izak.train ~ poly(trend,2) + fourier(izak.train, K = k))
  aic <- as.numeric(CV(fit)[2])
  if (aic < min) {
    min <- aic
    best <- k
  }
}

# Plot the bestfit model
bestfit <- tslm(izak.train ~ trend + fourier(izak.train, K = best))
autoplot(izak.train, 
         main = 'Fourier on Training Data',
         ylab = 'Visitors')+
  autolayer(bestfit$fitted.values,
            series = as.character(best), color = 'red')

The fourier model seems to do really well on the training data, lets see the forecasts and the test results

fourier.fc <- forecast(bestfit, 
         newdata = data.frame(fourier(izak.train, K = best, h = 49)))

autoplot(fourier.fc, PI = F,
         main = 'Fourier Forecasts',
         ylab = 'Visitors')

The forecasts don’t end up looking too good, at least when compared with ets() or holt-winters hw().

accuracy(fourier.fc, izak.test) %>% knitr::kable(caption = 'Fourier Accuracy Output') %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
Fourier Accuracy Output
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 0.0000 204.1248 138.6950 -3.525013 29.26996 0.1518619 -0.1607332 NA
Test set 182.3367 693.2873 522.2779 -4.863577 33.04589 0.5718599 0.2864045 0.9331468

The fourier model has the best MASE of the three models I tried. I would have gone with the fourier if I anticipated the out-of-sample data to look more like the sample data, but I have a strong suspicion that this is not the case. Remember we are expecting the golden week to take place, so it is likely that the multiplicative nature of the data will continue, and the fourier doesn’t capture that. Thus, going with RMSE, I will use ets() for Izakaya forecasts.

sample_submission$air_store_id <- str_sub(sample_submission$id, 1, 20)
sample_submission$date <- str_sub(sample_submission$id, 22, 31)
sample_submission <- merge(sample_submission, air_store_info[, c(1,2)], by=c('air_store_id'), all.x = T, all.y = F)
izak.ets.full <- ets(ts(izak.sample.ts["visitors"], start = 1, frequency = 7))
izak.fc <- forecast(izak.ets.full, h = 39)

izak.forecast <- data.frame(date = sample_submission$date[1:39], fc = as.numeric(izak.fc$upper[,2])/(nrow(air_store_info[air_store_info$air_genre_name=='Izakaya',])))
for (i in 1:nrow(sample_submission)) {
  if (sample_submission[i, 5] == 'Izakaya') {
    sample_submission[i, 'visitors'] = izak.forecast[izak.forecast$date == sample_submission[i,4], 2]
  }
}

Cafe/Sweets

cafe <- air_full[air_full$air_genre_name=='Cafe/Sweets',]
cafe.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = cafe)
autoplot(ts(cafe.ts$visitors)) +
  labs(title = 'All Data Cafe/Sweets',
       x = 'Time',
       y = 'Visitors')

cafe.sample <- cafe[which(cafe$air_store_id %in% forecast_stores),]
cafe.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = cafe.sample)
autoplot(ts(cafe.sample.ts$visitors)) +
  labs(title = 'Sample Submission Cafe/Sweets',
       x = 'Time',
       y = 'Visitors')

I will also use the cafe sample data since it tracks the overall data pretty well.

# The frequency is set to 7 to represent the weekly fluctuations
cafe.train <- ts(cafe.sample.ts[1:round(nrow(cafe.sample.ts)*0.90), 
                                "visitors"], 
                 start = 1, frequency = 7)
cafe.test <- ts(cafe.sample.ts[round(nrow(cafe.sample.ts)*0.9):nrow(cafe.sample.ts),
                               "visitors"], 
                start = end(cafe.train) + c(0, 0), frequency = 7)

ETS Forecasting

Lets use ETS again for our Cafe/Sweets data.

cafe.ets <- ets(cafe.train, allow.multiplicative.trend = T, model = "MAM")
cafe.ets.forecast <- forecast(cafe.ets, h = 45)
autoplot(cafe.ets.forecast, PI = F) + 
  labs(y = 'Visitors')

accuracy(cafe.ets.forecast,
         x = cafe.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
ETS Accuracy Output
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -8.151733 71.15894 49.84128 -132.75550 157.25948 1.038857 0.1480382 NA
Test set 22.652709 116.42641 87.99563 -15.24103 45.12135 1.834119 0.4771851 0.9360908

We have our ETS baseline.

Fourier

This time I will not consider the holt-winters, and jump straight into the fourier.

cafe.train <- ts(cafe.sample.ts[1:round(nrow(cafe.sample.ts)*0.90), "visitors"], frequency = 365)
cafe.test <- ts(cafe.sample.ts[round(nrow(cafe.sample.ts)*0.9):nrow(cafe.sample.ts), "visitors"],
                start = end(cafe.train) + c(0, 0), frequency = 365)
# Cycle through different models and select the best fit
min <- Inf
best <- 0
for (k in 1:182) {
  fit <- tslm(cafe.train ~ poly(trend,2) + fourier(cafe.train, K = k))
  aic <- as.numeric(CV(fit)[2])
  if (aic < min) {
    min <- aic
    best <- k
  }
}

# Plot the bestfit model
bestfit <- tslm(cafe.train ~ trend + fourier(cafe.train, K = best))
autoplot(cafe.train, 
         main = 'Fourier on Training Data',
         ylab = 'Visitors')+
  autolayer(bestfit$fitted.values,
            series = as.character(best), color = 'red')

The fourier model seems to do really well on the training data, lets see the forecasts and the test results

fourier.fc <- forecast(bestfit, 
         newdata = data.frame(fourier(cafe.train, K = best, h = 45)))

autoplot(fourier.fc, PI = F,
         main = 'Fourier Forecasts',
         ylab = 'Visitors')

accuracy(fourier.fc, cafe.test) %>% knitr::kable(caption = 'Fourier Accuracy Output') %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
Fourier Accuracy Output
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 0.00000 44.98388 29.07108 -26.58356 68.39328 0.1556562 0.0783414 NA
Test set 29.74226 189.85987 152.43916 -46.84700 85.92032 0.8162097 0.3660412 1.526767

We get a similar story with our cafe/sweets predictions.

cafe.ets.full <- ets(ts(cafe.sample.ts["visitors"], start = 1, frequency = 7))
cafe.fc <- forecast(cafe.ets.full, h = 39)

cafe.forecast <- data.frame(date = sample_submission$date[1:39], fc = as.numeric(cafe.fc$upper[,2])/nrow(air_store_info[air_store_info$air_genre_name=='Cafe/Sweets',]))
for (i in 1:nrow(sample_submission)) {
  if (sample_submission[i, 5] == 'Cafe/Sweets') {
    sample_submission[i, 'visitors'] = cafe.forecast[cafe.forecast$date == sample_submission[i,4], 2]
  }
}

Dining Bar

dining <- air_full[air_full$air_genre_name=='Dining bar',]
dining.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = dining)
autoplot(ts(dining.ts$visitors)) +
  labs(title = 'All Data Dining Bar',
       x = 'Time',
       y = 'Visitors')

dining.sample <- dining[which(dining$air_store_id %in% forecast_stores),]
dining.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = dining.sample)
autoplot(ts(dining.sample.ts$visitors)) +
  labs(title = 'Sample Submission Dining Bar',
       x = 'Time',
       y = 'Visitors')

I will also use the dining sample data since it tracks the overall data pretty well.

# The frequency is set to 7 to represent the weekly fluctuations
dining.train <- ts(dining.sample.ts[1:round(nrow(dining.sample.ts)*0.90), 
                                "visitors"], 
                 start = 1, frequency = 7)
dining.test <- ts(dining.sample.ts[round(nrow(dining.sample.ts)*0.9):nrow(dining.sample.ts),
                               "visitors"], 
                start = end(dining.train) + c(0, 0), frequency = 7)

ETS Forecasting

This time I will just stick to ETS as my forecasting method of choice.

dining.ets <- ets(dining.train, allow.multiplicative.trend = T, model = "MAM")
dining.ets.forecast <- forecast(dining.ets, h = 45)
autoplot(dining.ets.forecast, PI = F) + 
  labs(y = 'Visitors')

accuracy(dining.ets.forecast,
         x = dining.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
ETS Accuracy Output
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -14.87462 155.4840 91.01057 -67.066315 89.63027 0.8978197 0.3821467 NA
Test set 18.53166 202.2967 152.40547 -4.032356 29.45463 1.5034808 0.3749917 0.7581085
dining.ets.full <- ets(ts(dining.sample.ts["visitors"], start = 1, frequency = 7))
dining.fc <- forecast(dining.ets.full, h = 39)

dining.forecast <- data.frame(date = sample_submission$date[1:39], fc = as.numeric(dining.fc$upper[,2])/nrow(air_store_info[air_store_info$air_genre_name=='Dining bar',]))
for (i in 1:nrow(sample_submission)) {
  if (sample_submission[i, 5] == 'Dining bar') {
    sample_submission[i, 'visitors'] = dining.forecast[dining.forecast$date == sample_submission[i,4], 2]
  }
}

Italian/French

itfr <- air_full[air_full$air_genre_name=='Italian/French',]
itfr.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = itfr)
autoplot(ts(itfr.ts$visitors)) +
  labs(title = 'All Data Italian/French',
       x = 'Time',
       y = 'Visitors')

itfr.sample <- itfr[which(itfr$air_store_id %in% forecast_stores),]
itfr.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = itfr.sample)
autoplot(ts(itfr.sample.ts$visitors)) +
  labs(title = 'Sample Submission Italian/French',
       x = 'Time',
       y = 'Visitors')

I will also use the Italian/French sample data since it tracks the overall data pretty well.

# The frequency is set to 7 to represent the weekly fluctuations
itfr.train <- ts(itfr.sample.ts[1:round(nrow(itfr.sample.ts)*0.90), 
                                "visitors"], 
                 start = 1, frequency = 7)
itfr.test <- ts(itfr.sample.ts[round(nrow(itfr.sample.ts)*0.9):nrow(itfr.sample.ts),
                               "visitors"], 
                start = end(itfr.train) + c(0, 0), frequency = 7)

ETS Forecasting

Again I will stick to ETS as my forecasting model of choice.

itfr.ets <- ets(itfr.train, allow.multiplicative.trend = T, model = "MAM")
itfr.ets.forecast <- forecast(itfr.ets, h = 49)
autoplot(itfr.ets.forecast, PI = F) + 
  labs(y = 'Visitors')

accuracy(itfr.ets.forecast,
         x = itfr.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
ETS Accuracy Output
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -3.471546 145.9475 78.20574 -25.298070 45.03443 0.7787832 0.3850765 NA
Test set 82.516461 147.3630 124.20539 8.809987 16.34202 1.2368538 0.2630564 0.5240515

There seems to be a good fit for this model, by that I mean the RMSE of the training and test set are very similar, so I would expect this prediciton to be robust.

itfr.ets.full <- ets(ts(itfr.sample.ts["visitors"], start = 1, frequency = 7))
itfr.fc <- forecast(itfr.ets.full, h = 39)

itfr.forecast <- data.frame(date = sample_submission$date[1:39], fc = as.numeric(itfr.fc$upper[,2])/nrow(air_store_info[air_store_info$air_genre_name=='Italian/French',]))
for (i in 1:nrow(sample_submission)) {
  if (sample_submission[i, 5] == 'Italian/French') {
    sample_submission[i, 'visitors'] = itfr.forecast[itfr.forecast$date == sample_submission[i,4], 2]
  }
}

Others

I repeat the same exercise for all other stores.

Again I will stick to ETS as my forecasting model of choice.

other_stores <- unique(air_store_info$air_genre_name)[-c(1,2,4,5)]

for (store in other_stores) {
  other <- air_full[air_full$air_genre_name == store,]
  other.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = other)
  
  other.ets.full <- ets(ts(other.ts["visitors"], start = 1, frequency = 7))
  other.fc <- forecast(other.ets.full, h = 39)
  
  other.forecast <- data.frame(date = sample_submission$date[1:39], 
                               fc = as.numeric(other.fc$upper[,2])/nrow(air_store_info[air_store_info$air_genre_name==store,]))
  
  for (i in 1:nrow(sample_submission)) {
  if (sample_submission[i, 5] == store) {
    sample_submission[i, 'visitors'] = other.forecast[other.forecast$date == sample_submission[i,4], 2]
  }
}
  
}

Export

#  write.csv(sample_submission[, c(2,3)], 'submission.csv', row.names = F)

Individual stores

This will take a long time to run, it is commented out by default, but can be run by removing the string.

'for (i in unique(sample_submission$air_store_id)) {
  single <- air_visit_data[air_visit_data$air_store_id == i,]
  visit <- ts(single$visitors, frequency = 7)
  single.ets <- ets(visit)
  single.fc <- forecast(single.ets, h = 39)
  single.forecast <- data.frame(date = as.numeric(as.Date(sample_submission$date[1:39])),
                                fc = as.numeric(single.fc$upper[,2]))
  
    for (j in 1:nrow(sample_submission)) {
      if (sample_submission[j,1] == i){
        sample_submission[j,3] <- single.forecast[single.forecast$date == as.numeric(as.Date(sample_submission[j,4])), 2]
      }
    }
}'
## [1] "for (i in unique(sample_submission$air_store_id)) {\n  single <- air_visit_data[air_visit_data$air_store_id == i,]\n  visit <- ts(single$visitors, frequency = 7)\n  single.ets <- ets(visit)\n  single.fc <- forecast(single.ets, h = 39)\n  single.forecast <- data.frame(date = as.numeric(as.Date(sample_submission$date[1:39])),\n                                fc = as.numeric(single.fc$upper[,2]))\n  \n    for (j in 1:nrow(sample_submission)) {\n      if (sample_submission[j,1] == i){\n        sample_submission[j,3] <- single.forecast[single.forecast$date == as.numeric(as.Date(sample_submission[j,4])), 2]\n      }\n    }\n}"

Export

 # write.csv(sample_submission[, c(2,3)], 'submission.csv', row.names = F)